Right, so let's start by importing some relevant libraries and the datasets from the previous section.
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
Quick reminder:
# Group 1:
# items1 = pd.read_csv('data/Created in part 01/group1_items.csv', index_col='Invoice', parse_dates=['InvoiceDate'], dtype={'Customer ID': str})
invoices1 = pd.read_csv('data/Created in part 01/group1_invoices.csv', index_col='Invoice', parse_dates=['InvoiceDate'], dtype={'Customer ID': str})
# Group 2:
# items2 = pd.read_csv('data/Created in part 01/group2_items.csv', index_col='Invoice', parse_dates=['InvoiceDate'], dtype={'Customer ID': str})
# invoices2 = pd.read_csv('data/Created in part 01/group2_invoices.csv', index_col='Invoice', parse_dates=['InvoiceDate'], dtype={'Customer ID': str})
# # Group 3:
# items3 = pd.read_csv('data/Created in part 01/group3_items.csv', index_col='Invoice', parse_dates=['InvoiceDate'], dtype={'Customer ID': str})
# invoices3 = pd.read_csv('data/Created in part 01/group3_invoices.csv', index_col='Invoice', parse_dates=['InvoiceDate'], dtype={'Customer ID': str})
several graphs will be created. Data scientists are good at managing data and creating all sorts of weird stuff, but all these outputs might look confusing for outsiders. Therefore, Streamlit might come in handy, since it is a powerful tool for dashboard creation.
Considering we have 3 groups of different data, it becomes too much time-consuming creating separate functions for each group. We can make things a lot easier by creating general functions that can take any input and return the appropriate output. This will also simplify our Streamlit code, making it faster to develop and easier to cache, if needed be. "By default", let's use the datasets from Group 1 as the basis for our visualizations and analyses.
Let's get things going:
invoices1
| Quantity | Price | Customer ID | InvoiceDate | |
|---|---|---|---|---|
| Invoice | ||||
| 489434 | 166 | 32.65 | 13085.0 | 2009-01-12 07:45:00 |
| 489435 | 60 | 10.50 | 13085.0 | 2009-01-12 07:46:00 |
| 489436 | 193 | 70.88 | 13078.0 | 2009-01-12 09:06:00 |
| 489437 | 145 | 83.45 | 15362.0 | 2009-01-12 09:08:00 |
| 489438 | 826 | 44.05 | 18102.0 | 2009-01-12 09:24:00 |
| ... | ... | ... | ... | ... |
| 538167 | 125 | 76.15 | 14713.0 | 2010-09-12 18:58:00 |
| 538168 | 167 | 68.95 | 14702.0 | 2010-09-12 19:23:00 |
| 538169 | 33 | 39.20 | 13230.0 | 2010-09-12 19:28:00 |
| 538170 | 133 | 70.54 | 13969.0 | 2010-09-12 19:32:00 |
| 538171 | 194 | 165.58 | 17530.0 | 2010-09-12 20:01:00 |
20573 rows × 4 columns
fig = make_subplots(rows=1, cols=2)
fig.add_trace(go.Scatter(y=invoices1.Price, x=invoices1.InvoiceDate, mode='markers', name='Price'), row=1, col=1)
fig.add_trace(go.Scatter(y=invoices1.Quantity, x=invoices1.InvoiceDate, mode='markers', name='Quantity'), row=1, col=2)
fig.show('svg')
Uh-oh, looks like our data is messy. Invoices before 12/Dec/2009 were all clumped together into the 12th day of the month, therefore, we don't have available data on a daily basis.
Let's see how this would look like if we were to regroup everything into monthly data.
resampled_invoices1 = invoices1.resample(rule='M', on='InvoiceDate').sum() # grouping by month
fig = make_subplots(rows=1, cols=2)
fig.add_trace(go.Bar(y=resampled_invoices1.Price, x=resampled_invoices1.index, name='Price'), row=1, col=1)
fig.add_trace(go.Bar(y=resampled_invoices1.Quantity, x=resampled_invoices1.index, name='Quantity'), row=1, col=2)
# by default, the last day of the month is shown, but the data represents the whole month
fig.show('svg')
Weird, still. To be honest, we should just drop all these data from Nov/2009 backwards.
Also, Dec/2010 seems incomplete and we should just drop it if that is the casse. Let's check.
invoices1.query("InvoiceDate > '2009-11-30'").InvoiceDate.sort_values()[:5] # querying and sorting data after 30/Nov/2009
Invoice 491648 2009-12-13 09:58:00 491649 2009-12-13 10:28:00 491651 2009-12-13 10:41:00 491652 2009-12-13 10:48:00 491653 2009-12-13 10:50:00 Name: InvoiceDate, dtype: datetime64[ns]
Indeed, data from Dec/2009 starts on the 13th...
invoices1.InvoiceDate.sort_values()[-5:]
Invoice 532639 2010-12-11 16:42:00 532641 2010-12-11 16:59:00 532642 2010-12-11 17:00:00 532643 2010-12-11 17:01:00 532644 2010-12-11 17:32:00 Name: InvoiceDate, dtype: datetime64[ns]
... and Dec/2010 only had purchases up to the 11th day...
Let's also remove both months.
invoices1 = invoices1.query("(InvoiceDate > '2009-12-31') & (InvoiceDate < '2010-12-01')").sort_values('InvoiceDate')
invoices1.head()
| Quantity | Price | Customer ID | InvoiceDate | |
|---|---|---|---|---|
| Invoice | ||||
| 496349 | 228 | 65.51 | 14739.0 | 2010-01-02 08:13:00 |
| 496351 | 79 | 80.05 | 14370.0 | 2010-01-02 08:35:00 |
| 496352 | 4 | 4.90 | NaN | 2010-01-02 09:45:00 |
| 496354 | 98 | 25.61 | 12810.0 | 2010-01-02 09:52:00 |
| 496355 | 432 | 1.69 | 16684.0 | 2010-01-02 09:55:00 |
invoices1.groupby(invoices1['InvoiceDate'].dt.month_name().str[:3], sort=False).nunique()
# converts InvoiceDate to month-name only, and then gets the first 3 characters
| Quantity | Price | Customer ID | InvoiceDate | |
|---|---|---|---|---|
| InvoiceDate | ||||
| Jan | 511 | 1227 | 915 | 1337 |
| Feb | 487 | 1125 | 813 | 1195 |
| Mar | 540 | 1333 | 969 | 1393 |
| Apr | 558 | 1367 | 973 | 1464 |
| May | 542 | 1382 | 975 | 1482 |
| Jun | 530 | 1302 | 940 | 1384 |
| Jul | 570 | 1453 | 1023 | 1628 |
| Aug | 566 | 1392 | 1002 | 1499 |
| Sep | 615 | 1596 | 1150 | 1781 |
| Oct | 593 | 1699 | 1264 | 1790 |
| Nov | 622 | 1915 | 1392 | 2057 |
Ok, columns Quantity, Price and InvoiceDate are quite irrelevant in this dataframe.
The index is wrongly named, so we need to fix that as well.
Let's get that into a variable and plot it.
Psst! We can do the same thing for the unique number of invoices!
invoices1_monthly_customers = (
invoices1[['Customer ID']]
.groupby(invoices1['InvoiceDate'].dt.month_name().str[:3], sort=False)
.nunique()
)
invoices1_monthly_orders = (
invoices1[['Price']] # selecting any column (we just need the number of entries for each month, since each entry is a unique invoice number)
.groupby(invoices1['InvoiceDate'].dt.month_name().str[:3], sort=False)
.count()
) # the column for this dataframe is named 'Price', but we know that it is not the case, right?
fig = go.Figure(data=[
go.Bar(name='Monthly Active Customers', y=invoices1_monthly_customers['Customer ID'], x=invoices1_monthly_customers.index),
go.Bar(name='Monthly Orders', y=invoices1_monthly_orders['Price'], x=invoices1_monthly_orders.index)
])
fig.show("svg")